Gapminder is an independent Swedish foundation with no political, religious or economic affiliations, providing many datasets on the state of the world around us. We will be wrangling with the Gapminder dataset for this module, provided in the R package gapminder.
The gapminder dataset provides data for 142 countries, with values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
By the end of this module we will be able to produce a graph exploring income per person over the years for four different countries (US, Iran, China, and Nigeria).
library(tidyverse)
library(lubridate)
library(naniar)
library(gapminder)
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
gapminder %>% names()
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
You can think of a tibble as an enhanced data frame. One of the main differences between data frames and tibbles is the way they print data. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. Data frames on the other hand will print all rows, which isn’t very useful if your data frame is very big!
We see here that we have a tibble of size 1704 observations relating to 6 features. The data is in tidy format, with variables corresponding to columns, and observations corresponding to rows.
Throughout this module, we will explore a variety of R packages to help us wrangle our data and make the most sense out of it. These are just a selection out of many R packages that exist, and you might not need to use all of them in your analysis for your own data. These packages help us do the following:
gapminder %>% filter(year == "1997", continent == "Asia", lifeExp > 75)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Hong Kong, China Asia 1997 80 6495918 28378.
## 2 Israel Asia 1997 78.3 5531387 20897.
## 3 Japan Asia 1997 80.7 125956499 28817.
## 4 Kuwait Asia 1997 76.2 1765345 40301.
## 5 Singapore Asia 1997 77.2 3802309 33519.
## 6 Taiwan Asia 1997 75.2 21628605 20207.
Looking at rows corresponding to the year 1997 in Asia with life expectancy greater than 75.
gapminder %>% filter(year == "1997", continent == "Asia", lifeExp > 75) %>% select(country, lifeExp)
## # A tibble: 6 x 2
## country lifeExp
## <fct> <dbl>
## 1 Hong Kong, China 80
## 2 Israel 78.3
## 3 Japan 80.7
## 4 Kuwait 76.2
## 5 Singapore 77.2
## 6 Taiwan 75.2
Looking at only the country and life expectancy variable that matches the previous condition.
gapminder %>% filter(year == "1997", continent == "Asia", lifeExp > 75) %>% select(-lifeExp, -continent, -year)
## # A tibble: 6 x 3
## country pop gdpPercap
## <fct> <int> <dbl>
## 1 Hong Kong, China 6495918 28378.
## 2 Israel 5531387 20897.
## 3 Japan 125956499 28817.
## 4 Kuwait 1765345 40301.
## 5 Singapore 3802309 33519.
## 6 Taiwan 21628605 20207.
Using - to deselect variables.
gapminder <- gapminder %>% mutate(gdp = pop*gdpPercap)
Notice here we assigned this mutated data back to the same variable gapminder. gapminder now has a new variable gdp added, that is the product of the two existing variables pop and gdpPercap.
Showing rows by lowest to highest gdp in 2007 excluding continent, pop and gdpPercap variables.
gapminder %>% filter(year == "2007") %>% arrange(gdp) %>% head() %>% select(-pop, -gdpPercap, -continent, -year)
## # A tibble: 6 x 3
## country lifeExp gdp
## <fct> <dbl> <dbl>
## 1 Sao Tome and Principe 65.5 319014077.
## 2 Comoros 65.2 701111696.
## 3 Guinea-Bissau 46.4 852652874.
## 4 Djibouti 54.8 1033689705.
## 5 Gambia 59.4 1270911775.
## 6 Liberia 45.7 1323912407.
head(gapminder)
## # A tibble: 6 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
Same data but in descending order
gapminder %>% filter(year == "2007") %>% arrange(desc(gdp)) %>% head() %>% select(-pop, -gdpPercap, -continent, -year)
## # A tibble: 6 x 3
## country lifeExp gdp
## <fct> <dbl> <dbl>
## 1 United States 78.2 1.29e13
## 2 China 73.0 6.54e12
## 3 Japan 82.6 4.04e12
## 4 India 64.7 2.72e12
## 5 Germany 79.4 2.65e12
## 6 United Kingdom 79.4 2.02e12
gapminder %>% group_by(continent)
## # A tibble: 1,704 x 7
## # Groups: continent [5]
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # … with 1,694 more rows
The key difference is now the addition of Groups: continent [5] to the output. In the background, the group_by() function has essentially worked by splitting your data frame into 5 separate data frames based on the categorical variable you specify, in this case, continents.
tally() counts rows, combined with group_by() we can count how many rows there are for each continent group:
gapminder %>% group_by(continent) %>% tally()
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
This will show how many rows have lifeExp higher than the average lifeExp for their continent group:
gapminder %>% group_by(continent) %>% filter(lifeExp > mean(lifeExp)) %>% tally()
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 282
## 2 Americas 176
## 3 Asia 216
## 4 Europe 189
## 5 Oceania 10
If we had just filtered rows without group_by(), it will calculate the average lifeExp for the entire gapminder dataset instead of respective continent groups:
gapminder %>% filter(lifeExp > mean(lifeExp)) %>% count(continent)
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 80
## 2 Americas 218
## 3 Asia 224
## 4 Europe 349
## 5 Oceania 24
This shows the average lifeExp and total gdp over years.
gapminder %>% group_by(year) %>% summarise(mean_lifeExp = mean(lifeExp), total_gdp = sum(gdpPercap*pop))
## # A tibble: 12 x 3
## year mean_lifeExp total_gdp
## <int> <dbl> <dbl>
## 1 1952 49.1 7.04e12
## 2 1957 51.5 8.90e12
## 3 1962 53.6 1.10e13
## 4 1967 55.7 1.42e13
## 5 1972 57.6 1.84e13
## 6 1977 59.6 2.23e13
## 7 1982 61.5 2.54e13
## 8 1987 63.2 3.01e13
## 9 1992 64.2 3.45e13
## 10 1997 65.0 4.10e13
## 11 2002 65.7 4.73e13
## 12 2007 67.0 5.81e13
summarise_at() provide summaries of many variables at the same time. Here we want summaries of the minimum and maximum life expectancy and total gdp.
gapminder %>%
filter(year %in% c(1977, 2002)) %>%
group_by(continent, year) %>%
summarise_at(vars(lifeExp, gdpPercap), list(~min(.), ~max(.)))
## # A tibble: 10 x 6
## # Groups: continent [5]
## continent year lifeExp_min gdpPercap_min lifeExp_max gdpPercap_max
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1977 36.8 502. 67.1 21951.
## 2 Africa 2002 39.2 241. 75.7 12522.
## 3 Americas 1977 49.9 1874. 74.2 24073.
## 4 Americas 2002 58.1 1270. 79.8 39097.
## 5 Asia 1977 31.2 371 75.4 59265.
## 6 Asia 2002 42.1 611 82 36023.
## 7 Europe 1977 59.5 3528. 76.1 26982.
## 8 Europe 2002 70.8 4604. 80.6 44684.
## 9 Oceania 1977 72.2 16234. 73.5 18334.
## 10 Oceania 2002 79.1 23190. 80.4 30688.
Graph of gdpPercap over years for US, China, Nigeria and Iran
gapminder %>%
filter(country == "United States" | country == "China" | country == "Nigeria"| country == "Iran") %>%
ggplot(aes(x = year, y = gdpPercap, colour = country))+ geom_line() + scale_y_log10() + ggtitle("gdpPercap from 1950 - 2007")
However, applying dplyr functions is not always enough to be able to get data into a form ready for visualisation.
Suppose we want to plot the following:
group_data <- gapminder %>% group_by(year) %>% summarise(mean_lifeExp = mean(lifeExp), total_gdp = sum(gdpPercap*pop))
We want to plot both the average life expectancy and total gdp as lines on the same graph. We could try and pipe the summarised data frame directly into the ggplot command, however, we have no clear way to map both average life expectancy AND total gdp to the y variable, and year on the x-axis.
group_data %>% pivot_longer(-year, names_to = "summary", values_to = "values")
## # A tibble: 24 x 3
## year summary values
## <int> <chr> <dbl>
## 1 1952 mean_lifeExp 4.91e 1
## 2 1952 total_gdp 7.04e12
## 3 1957 mean_lifeExp 5.15e 1
## 4 1957 total_gdp 8.90e12
## 5 1962 mean_lifeExp 5.36e 1
## 6 1962 total_gdp 1.10e13
## 7 1967 mean_lifeExp 5.57e 1
## 8 1967 total_gdp 1.42e13
## 9 1972 mean_lifeExp 5.76e 1
## 10 1972 total_gdp 1.84e13
## # … with 14 more rows
We now have a much longer data frame (twice as many rows as before), with now all the values stored in a variable value and the summary of these values now in summary.
This is now perfect for ggplot, as the variable value can directly be used for the y variable, and two lines can be generated using the summary variable. We now pipe this directly into ggplot:
group_data %>%
pivot_longer(-year, names_to = "summary", values_to = "values") %>%
ggplot(aes(x= year, y = values, color= summary)) + geom_line() + facet_wrap(.~summary, scales = "free_y")
First, define long_form as the long format of our data grouped by year.
long_form <- group_data %>% pivot_longer(-year, names_to = "summary", values_to = "values")
We can then convert back to wide form by telling R that the column names belong in the variable summary, and the values are in a variable called values. And tada! Our original data frame is back!
long_form %>%
pivot_wider(names_from = summary, values_from = values)
## # A tibble: 12 x 3
## year mean_lifeExp total_gdp
## <int> <dbl> <dbl>
## 1 1952 49.1 7.04e12
## 2 1957 51.5 8.90e12
## 3 1962 53.6 1.10e13
## 4 1967 55.7 1.42e13
## 5 1972 57.6 1.84e13
## 6 1977 59.6 2.23e13
## 7 1982 61.5 2.54e13
## 8 1987 63.2 3.01e13
## 9 1992 64.2 3.45e13
## 10 1997 65.0 4.10e13
## 11 2002 65.7 4.73e13
## 12 2007 67.0 5.81e13
Investigate life expectancy in Europe from 1952 - 2007
gapminder %>% filter(continent == "Asia") %>% group_by(year) %>% summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp)) %>% pivot_longer(-year, names_to = "summary", values_to = "values") %>% ggplot(aes(x = year, y = values, color = summary)) + geom_line(size = 1.25) + ggtitle("Minimum and Maximum Life Expectancies in Asia, 1950 - 2007")
select_if(is.numeric) selects columns that are of the numeric class.
library(purrr)
library(tidyr)
library(mlbench)
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 0 0 25.6 0.201 30 neg
PimaIndiansDiabetes %>%
select_if(is.numeric) %>%
pivot_longer(everything(),names_to = "variable", values_to = "value") %>%
ggplot(aes(value)) +
facet_wrap(~ variable, scales = "free") +
geom_histogram() +
theme_bw(base_size = 18)
The histograms above reveal implicit missing values in glucose, pressure, triceps, insulin, and mass (since they can’t be zero!). So we convert these explicitly to missing values.
tib <- as_tibble(PimaIndiansDiabetes)
is.zero <- function(x) { return(x==0); }
# Replace implicit missing values with NAs
dat <- tib %>%
mutate_at(which(colnames(tib)%in%c("glucose","pressure","triceps","insulin","mass")),
funs(replace(., is.zero(.), NA)))
# miss_var_summary(dat) summarise the amount of missing values in each variables.
miss_var_summary(dat)
## # A tibble: 9 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 insulin 374 48.7
## 2 triceps 227 29.6
## 3 pressure 35 4.56
## 4 mass 11 1.43
## 5 glucose 5 0.651
## 6 pregnant 0 0
## 7 pedigree 0 0
## 8 age 0 0
## 9 diabetes 0 0
# gg_miss_var(dat) shows the relative amounts of missingness.
gg_miss_var(dat)
# vis_miss(dat) shows snapshot of overall amount of missingness in the data
vis_miss(dat)
We can check for skewness by first looking at numerical summaries of the median and mean for the variables in our data.
dat %>%
select_if(is.numeric) %>%
pivot_longer(everything(),names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(mean = mean(value, na.rm = TRUE), median = median(value, na.rm = TRUE))
## # A tibble: 8 x 3
## variable mean median
## <chr> <dbl> <dbl>
## 1 age 33.2 29
## 2 glucose 122. 117
## 3 insulin 156. 125
## 4 mass 32.5 32.3
## 5 pedigree 0.472 0.372
## 6 pregnant 3.85 3
## 7 pressure 72.4 72
## 8 triceps 29.2 29
We can also inspect the boxplots of these variables to visualise skewness, and to directly identify outliers.
dat %>%
select_if(is.numeric) %>%
pivot_longer(everything(),names_to = "variable", values_to = "value") %>%
ggplot(aes(value, x = variable)) +
facet_wrap(~ variable, scales = "free") +
geom_boxplot() +
theme_bw(base_size = 18)
We can see that outliers are present in many of our variables. But what do we do with them? First of all, it is important to note that you can expect outliers in your research and they can be a result of many different factors, including:
If outliers are a result of the first two reasons, then they should be excluded from the dataset. However, if the outliers are a result of natural variation then they should be included to ensure the integrity of the data. In fact, they may be very important to the research.